Sequencing data processing for the SynCom co-culture experiment

Author

Marine C. Cambon

document last compilation

February 5, 2026

Objective

The data was obtained after sequencing of the 16S rRNA gene V4 and sequencing by Novogene using Illumina (Novaseq?) PE250. Sequences will be demultiplexed and then processed using dada2 within a targets workflow.

Setting up {renv}, {targets} and packages

I am installing the different packages I need using {renv} then initialise renv to created the renv.lockfile which records the packages version.

renv::install("bioc::dada2")
renv::install("targets")
renv::install("tarchetypes")
renv::install("quarto")
renv::init()
renv::snapshot()

I also create the _targets.R file which will contain the analysis workflow:

targets::use_targets()

Then we can visualise the workflow network

library(targets)
tar_visnetwork()

Read demultiplexing

The run contains multiplexed samples and only one primer pair. The full sequences of primers + spacer + tag are in data/tag_primers_16SV4_fwd.fasta and data/tag_primers_16SV4_rev.fasta.

The raw data were downloaded from Novogene on BlueBEAR at ~/cambonm-oak-syncoms/250730_SynCom_stability_experiment/01_raw_data

MD5sum of the file is checked and the file is then uncompressed

md5sum -c MD5.txt
> X204SC24085297-Z01-F002.tar: OK

tar -xvf X204SC24085297-Z01-F002.tar
cd X204SC24085297-Z01-F002/01.RawData/pool_1/
md5sum -c MD5.txt 
> pool_1_FKDN250298699-1A_22NW5FLT4_L5_1.fq.gz: OK
> pool_1_FKDN250298699-1A_22NW5FLT4_L5_2.fq.gz: OK

Because I sent a pool of sequences to Novogene which then get ligated to Illumina adapters, fragments are inserted in random orientation and need to be demultiplexed both ways and with both primer and tag!

Tip

In later versions of cutadapt, there is the --revcomp option to automatically look for primers in both orientation.

I tested the option and my old loop based version and seemed to get more or les the same unknown-unknown files size so I’ll just go ahead with the --revcomp option from now on.

Running cutadapt

I ran Cutadapt v4.9 on blueBEAR:

#!/bin/bash
#SBATCH -o cutadapt.revcomp.demultiplex.o.%J    # Job output file
#SBATCH -e cutadapt.revcomp.demultiplex.e.%J    # Job error file
#SBATCH --ntasks=4            # number of parallel processes (tasks)
#SBATCH --qos=bbdefault       # selected queue
#SBATCH --time=72:00:00        # time limit

set -e

module purge
module load bear-apps/2023a
module load cutadapt/4.9-GCCcore-12.3.0

## 16SV4
export input_dir=/rds/projects/c/cambonm-oak-syncoms/250730_SynCom_stability_experiment/01_raw_data/PE250/X204SC24085297-Z01-F003/01.RawData/pool_1/
export output_dir=16SV4_revcomp
mkdir $output_dir

# --cores=0 auto detects the number of available cores
# --revcomp searchs for primers and adapters in both orientations

cutadapt \
    -e 0.15 --no-indels \
    --cores=0 \
    --revcomp \
    -g file:tag_primers_16SV4_fwd.fasta \
    -G file:tag_primers_16SV4_rev.fasta \
    -o $output_dir/{name1}-{name2}.R1.fastq.gz -p $output_dir/{name1}-{name2}.R2.fastq.gz \
    $input_dir/pool_1_1.fq.gz $input_dir/pool_1_2.fq.gz

Read number per samples after demutliplexing

I counted the number of read per sample after demultiplexing with the following code:

#!/bin/bash
#SBATCH -o count.demultiplex.o.%J    # Job output file
#SBATCH -e count.demultiplex.e.%J    # Job error file
#SBATCH --ntasks=1            # number of parallel processes (tasks)
#SBATCH --qos=bbdefault       # selected queue
#SBATCH --time=00:05:00        # time limit

set -e

module purge
module load bear-apps/2023a/live

touch count_demultiplexed.txt

for f in 16SV4_revcomp/*R1.fastq.gz
do
    c=$(zcat $f | grep @ | wc -l)
    echo $f $c >> count_demultiplexed.txt
done

Then plot the results

tar_load(fig_read_counts_demultiplex)
fig_read_counts_demultiplex

Every single sample has more than 1000 reads including the negative controls and unused tags! Because there were quite few samples in the pool the sequencing depth is very high. I will need to be quite stringent in the cleaning steps.

Note

Up to this point I ran the script manually and not within the targets workflow because I am not to sure how to deal with running bash scripts on slurm withing targets, but I’ll try and improve this in the future.

The {targets} workflow starts below ⬇️

Read quatlity filtering

First the demultiplexed reads are filtered based on their quality. Sequence with unknown bases and more than 2 expected errors are removed.

First we get the path of the demultiplexed files,

function(path, metadata) {
  fnFs <- (list.files(path, pattern=".R1.fastq", full.names = TRUE))
  fnRs <- sort(list.files(path, pattern=".R2.fastq", full.names = TRUE))
  sample.names <- metadata$sample_ID[match(basename(fnFs), basename(metadata$file))]
  fnFs <- fnFs[!is.na(sample.names)]
  fnRs <- fnRs[!is.na(sample.names)]
  sample.names <- sample.names[!is.na(sample.names)]
  return(list("fnFs" = fnFs, "fnRs" = fnRs, "sample_names"=sample.names))
}

Then we filter the sequences and write them in the output folder.

function(fns_paths, outp) {
  filtFs <- file.path(outp, "01_filtered", paste0(fns_paths[["sample_names"]], "_F_filt.fastq.gz"))
  filtRs <- file.path(outp, "01_filtered", paste0(fns_paths[["sample_names"]], "_R_filt.fastq.gz"))
  names(filtFs) <- fns_paths[["sample_names"]]
  names(filtRs) <- fns_paths[["sample_names"]]
  
  count_denoise <- filterAndTrim(fns_paths$fnFs, filtFs, fns_paths$fnRs, filtRs, #truncLen=c(240,160),
                                 maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
                                 compress=TRUE, multithread=FALSE) # Multicore is not working on RStudio server I am not sure why
  return(list("filtFs"=filtFs, "filtRs"=filtRs, "count_denoise"=count_denoise))
}

DADA2 denoising

Information from the dada2 tutorial:

The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).

denoise
function(filt_paths) {
  filtFs <- filt_paths[["filtFs"]]
  filtRs <- filt_paths[["filtRs"]]
 
  errF <- learnErrors(filtFs, multithread=FALSE)
  errR <- learnErrors(filtRs, multithread=FALSE)
  
  dadaFs <- dada(filtFs, err=errF, multithread=TRUE)
  dadaRs <- dada(filtRs, err=errR, multithread=TRUE)
  return(list("dadaFs"=dadaFs, "dadaRs"=dadaRs))
}

Paired-end merging

Forward and reverse reads are now merged together

mergePairs(dadas[["dadaFs"]], filt_paths[["filtFs"]], 
                        dadas[["dadaRs"]], filt_paths[["filtRs"]], 
                        verbose=TRUE))

Checking and filtering sequence lenght

Here we check the lenght of the sequences after merging

tar_load(fig_seqlength)
fig_seqlength

Wow, all ASVs are pretty much the same size, I don’t think I need to filter more on the length ! I’ll do it anyway just to build the target for future use.

tar_load(seqtab_filt)
seuil <- nchar(rownames(seqtab_filt))

After filtering, all sequences are 230 and 270 bp and the dataset has 3915 ASVs.

Removing chimera

rm(seqtab_filt)
tar_load(seqtab_filt_nochim)

After removing chimera, the dataset has 1119 ASVs.

Removing low abundace ASVs

I removed ASVs with less than 1000 reads in the whole dataset because these are low abundance in vitro samples so I expect all ASV to be in high abundance at least in the first time point of the experiment. But I might need to go back and change that later.

rm(seqtab_filt_nochim)
tar_load(seqtab_filt_nochim_abund)

After removing low abundance reads, the dataset has 44 ASVs.

Read loss

We compute the number of reads at each step of the pipeline:

tar_load(fig_read_loss)
fig_read_loss

Assigning taxonomy with my insilico reference database

I first assigned taxnomony using 100% identity with the assign_species() function from dada2, but it seems like I have a errors in sequences that were not detected by the denoising algorithm so I am assigning taxonomy again with BLAST+ and 98% identity. To perform this I am using rBLAST.

tar_load(asv_metadata_taxo)
head(asv_metadata_taxo)
  ASV_ID
1  ASV_1
2  ASV_2
3  ASV_3
4  ASV_4
5  ASV_5
6  ASV_6
                                                                                                                                                                                                                                                       sequence
1 TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAAGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG
2 TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGCGCTTAACGTGGGAACTGCATTTGAAACTGGCAAGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG
3 TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGAGCTTAACTTGGGAACTGCATTTGAAACTGGCAAGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG
4 TACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTGTTAAGTTGGATGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTGGCAAGCTAGAGTACGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACAGG
5 TACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGTGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACACTGAGGCGCGAAAGCGTGGGGAGCAAACAGG
6 TACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTGTTAAGTTGGATGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTGGCAAGCTAGAGTACAGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACAGG
  abundance prevalence  sseqid  pident length mismatch gapopen qstart qend
1   4862885        192  FOH354 100.000    253        0       0      1  253
2   2993893        192 FOS446B 100.000    253        0       0      1  253
3   2137064        192  FOS97B 100.000    253        0       0      1  253
4   1368243        190  FOS972  98.814    253        3       0      1  253
5    908920        192  FOH358 100.000    253        0       0      1  253
6    253107        159  FOS972  98.419    253        4       0      1  253
  sstart send    evalue bitscore multiple_matches
1    273   21 1.41e-135      468            FALSE
2     20  272 1.41e-135      468             TRUE
3    273   21 1.41e-135      468             TRUE
4    273   21 1.41e-130      451             TRUE
5    273   21 1.41e-135      468            FALSE
6    273   21 6.58e-129      446             TRUE

Session info

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.5.0 (2025-04-11)
 os       Debian GNU/Linux 13 (trixie)
 system   x86_64, linux-gnu
 ui       X11
 language en_GB:en
 collate  en_GB.UTF-8
 ctype    en_GB.UTF-8
 tz       Europe/London
 date     2026-02-05
 pandoc   3.6.3 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
 quarto   1.8.25 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 ! package      * version date (UTC) lib source
 P backports      1.5.0   2024-05-23 [?] CRAN (R 4.5.0)
 P base64url      1.4     2018-05-14 [?] RSPM (R 4.5.0)
 P BiocManager    1.30.26 2025-06-05 [?] CRAN (R 4.5.0)
 P callr          3.7.6   2024-03-25 [?] RSPM (R 4.5.0)
 P cli            3.6.5   2025-04-23 [?] CRAN (R 4.5.0)
 P codetools      0.2-20  2024-03-31 [?] CRAN (R 4.3.3)
 P data.table     1.17.4  2025-05-26 [?] RSPM (R 4.5.0)
 P digest         0.6.37  2024-08-19 [?] CRAN (R 4.5.0)
 P dplyr          1.1.4   2023-11-17 [?] CRAN (R 4.5.0)
 P evaluate       1.0.3   2025-01-10 [?] CRAN (R 4.5.0)
 P farver         2.1.2   2024-05-13 [?] CRAN (R 4.5.0)
 P fastmap        1.2.0   2024-05-15 [?] CRAN (R 4.5.0)
 P generics       0.1.4   2025-05-09 [?] CRAN (R 4.5.0)
 P ggplot2        3.5.2   2025-04-09 [?] CRAN (R 4.5.0)
 P glue           1.8.0   2024-09-30 [?] CRAN (R 4.5.0)
 P gtable         0.3.6   2024-10-25 [?] CRAN (R 4.5.0)
 P htmltools      0.5.8.1 2024-04-04 [?] CRAN (R 4.5.0)
   htmlwidgets    1.6.4   2023-12-06 [1] RSPM (R 4.5.0)
 P igraph         2.1.4   2025-01-23 [?] CRAN (R 4.5.0)
 P jsonlite       2.0.0   2025-03-27 [?] CRAN (R 4.5.0)
 P knitr          1.50    2025-03-16 [?] CRAN (R 4.5.0)
 P labeling       0.4.3   2023-08-29 [?] CRAN (R 4.5.0)
 P lifecycle      1.0.4   2023-11-07 [?] CRAN (R 4.5.0)
 P magrittr       2.0.3   2022-03-30 [?] CRAN (R 4.5.0)
 P pillar         1.10.2  2025-04-05 [?] CRAN (R 4.5.0)
 P pkgconfig      2.0.3   2019-09-22 [?] CRAN (R 4.5.0)
 P prettyunits    1.2.0   2023-09-24 [?] RSPM (R 4.5.0)
 P processx       3.8.6   2025-02-21 [?] RSPM (R 4.5.0)
 P ps             1.9.1   2025-04-12 [?] RSPM (R 4.5.0)
 P R6             2.6.1   2025-02-15 [?] CRAN (R 4.5.0)
 P RColorBrewer   1.1-3   2022-04-03 [?] CRAN (R 4.5.0)
   renv           1.1.5   2025-07-24 [1] RSPM (R 4.5.0)
 P rlang          1.1.6   2025-04-11 [?] CRAN (R 4.5.0)
 P rmarkdown      2.29    2024-11-04 [?] CRAN (R 4.5.0)
 P rstudioapi     0.17.1  2024-10-22 [?] CRAN (R 4.5.0)
 P scales         1.4.0   2025-04-24 [?] CRAN (R 4.5.0)
 P secretbase     1.0.5   2025-03-04 [?] RSPM (R 4.5.0)
 P sessioninfo    1.2.3   2025-02-05 [?] CRAN (R 4.5.0)
 P targets      * 1.11.3  2025-05-08 [?] RSPM (R 4.5.0)
 P tibble         3.2.1   2023-03-20 [?] RSPM (R 4.5.0)
 P tidyselect     1.2.1   2024-03-11 [?] CRAN (R 4.5.0)
 P vctrs          0.6.5   2023-12-01 [?] CRAN (R 4.5.0)
   visNetwork     2.1.4   2025-09-04 [1] RSPM (R 4.5.0)
 P withr          3.0.2   2024-10-28 [?] CRAN (R 4.5.0)
 P xfun           0.52    2025-04-02 [?] CRAN (R 4.5.0)
 P yaml           2.3.10  2024-07-26 [?] CRAN (R 4.5.0)

 [1] /home/marine/Documents/1_Postdoc_Birmingham/1_Experiments/250730_SynCom_stability_experiment_data_processing/renv/library/linux-debian-trixie/R-4.5/x86_64-pc-linux-gnu
 [2] /home/marine/.cache/R/renv/sandbox/linux-debian-trixie/R-4.5/x86_64-pc-linux-gnu/9a444a72

 * ── Packages attached to the search path.
 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────

Update the renv snapshot.

renv::snapshot()
- Lockfile written to "~/Documents/1_Postdoc_Birmingham/1_Experiments/250730_SynCom_stability_experiment_data_processing/renv.lock".